-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bump VMs, to Ubuntu 2204 with cgroups v1 #14972
Bump VMs, to Ubuntu 2204 with cgroups v1 #14972
Conversation
654d5b6
to
c07a067
Compare
Ubuntu (cgroups v1, runc) is failing on restore with what looks like the same issue as checkpoint-restore/criu#1935: # bin/podman --runtime runc --storage-driver vfs run -d --name foo quay.io/libpod/testimage:20220615 top
186bb39301e62a9bf3376a6b3ef0fcd77f268ce78cd1fca94fd095062679893b
# bin/podman --runtime runc --storage-driver vfs container checkpoint foo
186bb39301e62a9bf3376a6b3ef0fcd77f268ce78cd1fca94fd095062679893b
# bin/podman --runtime runc --storage-driver vfs container restore foo
Error: OCI runtime error: runc: criu failed: type NOTIFY errno 0
log file: /var/lib/containers/storage/vfs-containers/186bb39301e62a9bf3376a6b3ef0fcd77f268ce78cd1fca94fd095062679893b/userdata/restore.log
...
(00.070611) pie: 1: Preadv 0x55aa68fd5000:4096... (7 iovs)
(00.070677) pie: 1: `- returned 65536
(00.070680) pie: 1: `- skip pagemap
(00.070682) pie: 1: `- skip pagemap
(00.070685) pie: 1: `- skip pagemap
(00.070687) pie: 1: `- skip pagemap
(00.070690) pie: 1: `- skip pagemap
(00.070692) pie: 1: `- skip pagemap
(00.070695) pie: 1: `- skip pagemap
(00.070757) Error (criu/cr-restore.c:1492): 219029 stopped by signal 11: Segmentation fault
(00.071136) mnt: Switching to new ns to clean ghosts
(00.071492) Error (criu/cr-restore.c:2447): Restoring FAILED. @adrianreber any advice on how to get criu rebuilt for Ubuntu? |
Confirming evidence, I think: # grep -R RSEQ_SIG /usr/include
/usr/include/x86_64-linux-gnu/bits/rseq.h:/* RSEQ_SIG is a signature required before each abort handler code.
/usr/include/x86_64-linux-gnu/bits/rseq.h: RSEQ_SIG is used with the following reserved undefined instructions, which
/usr/include/x86_64-linux-gnu/bits/rseq.h:#define RSEQ_SIG 0x53053053
|
@rst0git knows how to update the obs and launchpad packages |
@edsantiago I pushed an update for version 3.17.1 in OBS. |
@rst0git thank you! I see what looks like an ubuntu 2204 log indicating that it succeeded (assuming 'xUbuntu' == 'Ubuntu'). Do you have a sense for how long that will then take to get into standard Ubuntu repos? |
A simple test with Ubuntu 22.04 container indicates that the criu package has been updated:
|
Thanks again. I've restarted the VM build, which will take a few hours; then I'll need to resubmit this PR using those images (plus some runc fixes); that too will take a few hours. Progress! |
c07a067
to
90835b2
Compare
Sigh. nope. I'll try again tomorrow. |
90835b2
to
b5612df
Compare
Failing in pod create --share-parent test:
The string in question is something to do with cgroups. @cdoern this is your code, could you spare some cycles to tell me how to fix it? The problem here is that it's failing in cgroupsv1 with runc, because we haven't been testing runc. TIA. |
Failing in Remote build .containerignore filtering embedded directory with a timeout in the @jwhonce this is your code, could you please tell me how to fix it? Like, for instance, is it absolutely necessary for the file to be |
@containers/podman-maintainers cry for help: all checkpoint tests are hanging in container environment. Everything that runs I can't reproduce. I've tried: $ sudo bin/podman run --rm --privileged --cgroupns=host -v $(pwd):/home/podman -v /dev/fuse:/dev/fuse -it quay.io/libpod/fedora_podman:c6706201604915200 bash
[root@9fcf6ce7a5a1 podman]# pm() { /home/podman/bin/podman --network-backend netavark --storage-driver vfs --cgroup-manager cgroupfs --events-backend file "$@"; }
[root@9fcf6ce7a5a1 podman]# pm all-sorts-of-things Maybe I'm missing some magic option. Two requests:
Right now that fails with "command timed out but I'm not going to tell you what command it was nor what its output was". I think it might be slightly more helpful to say "this was the command that timed out, and this was the output I got up to this point". I can't figure it out from the GInkgo docs, and have spent much too long on it already. Thank you! |
b5612df
to
cba9c8e
Compare
Sigh, another one. @containers/podman-maintainers who owns the gitlab tests? They're failing hard:
...and I bet a nickel it's because of the switch go |
Another new one having to do with checkpoint/restore and pods, I think? log. Symptom:
|
And another hard failure, timed out waiting for port XXXX, ubuntu remote only:
Seems networking-related. Are netavark et al up-to-date on Ubuntu? |
Another sigh. containerized failed again, with checkpoint hangs as expected. But I instrumented the timeout code so it would dump stdout/stderr... and nothing. |
Good news is, |
I don't know anything about the gitlab test but go 1.18 uses module mode by default and without go.mod this fails. The correct way to install things is |
@edsantiago This also pins the version to prevent breaking PRs when the tool is updated: diff --git a/contrib/cirrus/setup_environment.sh b/contrib/cirrus/setup_environment.sh
index 4952f8dd2..e2d0c655a 100755
--- a/contrib/cirrus/setup_environment.sh
+++ b/contrib/cirrus/setup_environment.sh
@@ -351,7 +351,7 @@ case "$TEST_FLAVOR" in
slug="gitlab.com/gitlab-org/gitlab-runner"
helper_fqin="registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-latest-pwsh"
ssh="ssh $ROOTLESS_USER@localhost -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o CheckHostIP=no env GOPATH=$GOPATH"
- showrun $ssh go get -u github.com/jstemmer/go-junit-report
+ showrun $ssh go install github.com/jstemmer/go-junit-report/[email protected]
showrun $ssh git clone https://$slug $GOPATH/src/$slug
showrun $ssh make -C $GOPATH/src/$slug development_setup
showrun $ssh bash -c "'cd $GOPATH/src/$slug && GOPATH=$GOPATH go get .'" Although I guess it would make more sense to directly install this into the VM images to reduce downloads and compile time at runtime. |
Likely still some parallel cni and netavark use? |
Well, the failure is 100% and it's been happening since my first iteration on this PR. That is: I've never seen it pass. That suggests something harder than a race condition. |
I am not saying it is a race, just that something is still using CNI while the e2e test use netavark, check the iptables output to be sure. |
cba9c8e
to
5bb4975
Compare
Getting my test writing skills ready for whatever 14867 introduced... I think I avoided most of the issues. |
@cdoern your PR includes lots of "bps" strings, so I'm guessing all these new bps failures are related. I'm outta here for the day (LONG DAY) but would appreciate your help. TIA. |
0e600b3
to
6823f5c
Compare
sorry about that, the good news is these failures are cropping up with both iops and bps devices so I can narrow it down. will keep you updated |
Don't sweat it. I really want to get this merged, because I don't want to be here on Friday, so I've resubmitted with |
...and enable the at-test-time confirmation, the one that double-checks that if CI requests runc we actually use runc. This exposed a nasty surprise in our setup: there are steps to define $OCI_RUNTIME, but that's actually a total fakeout! OCI_RUNTIME is used only in e2e tests, it has no effect whatsoever on actual podman itself as invoked via command line such as in system tests. Solution: use containers.conf Given how fragile all this runtime stuff is, I've also added new tests (e2e and system) that will check $CI_DESIRED_RUNTIME. Image source: containers/automation_images#146 Since we haven't actually been testing with runc, we need to fix a few tests: - handle an error-message change (make it work in both crun and runc) - skip one system test, "survive service stop", that doesn't work with runc and I don't think we care. ...and skip a bunch, filing issues for each: - containers#15013 pod create --share-parent - containers#15014 timeout in dd - containers#15015 checkpoint tests time out under $CONTAINER - containers#15017 networking timeout with registry - containers#15018 restore --pod gripes about missing --pod - containers#15025 run --uidmap broken - containers#15027 pod inspect cgrouppath broken - ...and a bunch more ("podman pause") that probably don't even merit filing an issue. Also, use /dev/urandom in one test (was: /dev/random) because the test is timing out and /dev/urandom does not block. (But the test is still timing out anyway, even with this change) Also, as part of the VM switch we are now using go 1.18 (up from 1.17) and this broke the gitlab tests. Thanks to @Luap99 for a quick fix. Also, slight tweak to containers#15021: include the timeout value, and reword message so command string is at end. Also, fixed a misspelling in a test name. Fixes: containers#14833 Signed-off-by: Ed Santiago <[email protected]>
6823f5c
to
0a160fe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Restarted the flake. Thanks everybody!
/lgtm |
Thanks everyone. There may be in-flight PRs that break cgroups V1, such that once they merge they will break CI for everyone. Please consider that during your PR reviews over the next week, and please suggest that everyone rebase. |
Thanks @edsantiago for your efforts here, so much fun isn't it 😉 Seriously though, I'm happy to see you successfully navigated building new VM images and integrating them into podman CI. I know there's some testing fallout from this merging, however I don't believe I got an answer to my update question: Could you help me update these VM images in c/skopeo, c/image, and c/storage? I've got one going for buildah already. |
@cevich I'm sorry, I seem to have missed that question, and can't even find it in my email archives. If I understand your reference correctly, you would like me to submit If I misunderstood, could you please point me at your original request, or help me understand? Again, I'm sorry for not being able to find it. |
Yes please. There's no need to wait on 157, it doesn't change anything in the VM images, only a few tooling containers. There are 14-repos total that use GCE images, and probably 8 desperately in need of an update. Most of the time once podman CI is passing, it goes pretty smoothly, though the Ubuntu version and runc update could cause some hiccups. So help with this is very much appreciated.
No worries, comments get eaten by github sometimes, and it's not like we receive a small amount of github mail. |
Well, we need to rebuild VMs anyway (#15025), so I'll wait until that's fixed. I tried Re-run on your PR, it still fails (same criu bug on ubuntu), so no point in doing anything until that's fixed. |
Ugh, -sigh-, okay, I guess I thought it would be simple this time 😢 Okay, I'll /hold all the PR's I already opened. |
Two fixes done in containers#14972 (the "oops test under runc again" PR which was not backported into 4.2): - "survive service stop" - skip. Test is only applicable under crun. - "volume exec/noexec" - update the expected error message One hail-mary fix for a test failure seen in RHEL87 gating: - "nonexistent labels" - slight tweak to expected error message None of these fixes will actually be tested in CI, because v4.2 does not run any runc tests. We'll have to wait and see what happens on the next RHEL build. Signed-off-by: Ed Santiago <[email protected]>
This exposed a nasty bug in our system-test setup: Ubuntu (runc) was writing a scratch containers.conf file, and setting CONTAINERS_CONF to point to it. This was well-intentionedly introduced in containers#10199 as part of our long sad history of not testing runc. What I did not understand at that time is that CONTAINERS_CONF is **dangerous**: it does not mean "I will read standard containers.conf and then override", it means "I will **IGNORE** standard containers.conf and use only the settings in this file"! So on Ubuntu we were losing all the default settings: capabilities, sysctls, all. Yes, this is documented in containers.conf(5) but it is such a huge violation of POLA that I need to repeat it. In containers#14972, as yet another attempt to fix our runc crisis, I introduced a new runc-override mechanism: create a custom /etc/containers/containers.conf when OCI_RUNTIME=runc. Unlike the CONTAINERS_CONF envariable, the /etc file actually means what you think it means: "read the default file first, then override with the /etc file contents". I.e., we get the desired defaults. But I didn't remember this helpers.bash workaround, so our runc testing has actually been flawed: we have not been testing with the system containers.conf. This commit removes the no-longer-needed and never-actually-wanted workaround, and by virtue of testing the cap-drops in kube generate, we add a regression test to make sure this never happens again. It's a little scary that we haven't been testing capabilities. Also scary: this PR requires python, for converting yaml to json. I think that should be safe: python3 'import yaml' and 'json' works fine on a RHEL8.7 VM from 1minutetip. Signed-off-by: Ed Santiago <[email protected]>
...and enable the at-test-time confirmation, the one that
double-checks that if CI requests runc we actually use runc.
This exposed a nasty surprise in our setup: there are steps to
define $OCI_RUNTIME, but that's actually a total fakeout!
OCI_RUNTIME is used only in e2e tests, it has no effect
whatsoever on actual podman itself as invoked via command
line such as in system tests. Solution: use containers.conf
Given how fragile all this runtime stuff is, I've also added
new tests (e2e and system) that will check $CI_DESIRED_RUNTIME.
Image source: containers/automation_images#146
Since we haven't actually been testing with runc, we need
to fix a few tests:
work with runc and I don't think we care.
...and skip a bunch, filing issues for each:
even merit filing an issue.
Also, use /dev/urandom in one test (was: /dev/random) because
the test is timing out and /dev/urandom does not block. (But
the test is still timing out anyway, even with this change)
Also, as part of the VM switch we are now using go 1.18 (up
from 1.17) and this broke the gitlab tests. Thanks to @Luap99
for a quick fix.
Also, slight tweak to #15021: include the timeout value, and
reword message so command string is at end.
Also, fixed a misspelling in a test name.
Fixes: #14833
Signed-off-by: Ed Santiago [email protected]